perm filename HOWTO[4,KMC] blob
sn#139993 filedate 1975-01-14 generic text, type T, neo UTF8
HOW TO MEASURE IMPROVEMENT OF A SIMULATION MODEL
ALONG A DIMENSION OF LINGUISTIC COMPREHENSION
COLBY, HILF, WITTNER, PARKISON, FAUGHT
To measure improvement one needs a scaled dimension and a
value on that dimension to be striven for. In a previous
communication (Colby and Hilf, 1974) a method was described for using
judges to rate a paranoid simulation model's performance along a
variety of dimensions. The judges consisted of randomly selected
psychiatrists who rated transcripts of interviews conducted in
natural language by other psychiatrists with paranoid patients and
with versions of the model (PARRY1). The interviewers and the raters
did not know that one of the interviewees was a computer simulation
of paranoid processes.
One of the rated dimensions was linguistic noncomprehension.
(The negation "non" was used to keep the ratings consistent with
other ratings being made at the same time). A judge rated each I-O
pair of an interview along this dimension on a scale of 0-9. The
judges proved to be reliable [Frank- concordance scores here on this
dimension]. The mean score received by the patients was 0.74 and by
the model 2.22. The difference between the two mean ratings is
significant at better than the 0.001 level.
Close study of the reasons for this difference revealed that
the model recognized topics in the natural language input but did not
sufficiently recognize exacly what was being said about a topic. The
pattern-recognition processes of the model failed to pick up
sufficient information about a topic to give a reply indicating
comprehension. The power of a pattern- matching approach in language
recognition is the ability to ignore as irrelevant both what it
recognizes and what it does not recognize at all. Its weakness lies
in not having enough patterns to match the tremendous variety of
expressions found in natural language dialogues.
To improve the language-recognition processes of the model
we designed several additional techniques which we shall only outline
here. A complete description of them can be found in Colby, Parkison
and Faught (1974).
In brief, the language-recognizing module of the current
paranoid model (PARRY2) progressively transforms the input until
a pattern is achieved which completely or fuzzily matches a more
abstract stored pattern. (See the flow diagram of Fig. 1). The
input expression is first preprocessed by translating words and
word groups (such as idioms) into internal synonyms which represent
our names of word classes. Words not in the recognizer's dictionary
are not included in the pattern being formed. Misspellings are
corrected, groups of words are contracted into single words, and
certain expansions are made (e.g. "dont" becomes "do not"). The
pattern is then bracketted into shorter, more manageable units
termed "segments". The resultant pattern is classified as "simple",
containing no delimiters, or "complex", consisting of two or more
simple patterns.
The algorithm then attempts a complete match of the
segments with stored simple patterns. When a match is found, the
stored pattern points to the name of a response function in
"memory" which decides what to do next. If a match is not found, a fuzzy
match is tried bt dropping elements in a segment one at a time
and trying for a match each time. In the case of complex patterns
this one-at-a-time dropping is carried out at the segment level. If
these methods do not produce a match, a default condition obtains
and the response module decides what to do.
For this language-recognition strategy to be
successful, a large number of words and word-combinations
must be recognized and converted into patterns which match
stored patterns. In the first experiment to be described, there
were 1900 dictionary entries and about 2200 patterns, 1700 being
simple and 500 complex.
EXPERIMENT 1
METHOD
Five clinicians interviewed both the old (PARRY1) and
new (PARRY2) versions of the model without knowing which was which.
All five agreed PARRY2 showed greater linguistic comprehension.
To obtain a more precise estimate, 19 graduate students were
paid to rate transcripts of these interviews. They rated each
I-O pair of each interview along a dimension of "linguistic
comprehension" ("Did the patient understand what the doctor
said?") on a 0-9 scale.
RESULTS
In the 10 interviews there was a total of %%%% I-O pairs.
On a 0-9 scale of linguistic comprehension, the mean rating of
PARRY1 was 5.256 and the mean rating of PARRY2 was 5.483. This
difference is significant at the 0.05 level (t=1.0935, one
tailed test).
These raters also rated transcripts of the original
eight interviews conducted by psychiatrists with PARRY1 and
with paranoid patients. PARRY1 received a mean rating 5.19 and
the patients 7.42. The difference is significant at the 0.001
level. This confirms the original test using psychiatrists
as raters. (Frank---how does it?)
The student raters gave PARRY1 in the original interviews
a mean rating of 5.19 and a mean rating of 5.26 in the experiment
under discussion. The difference is not statistically significant
( SD(difference)=0.1497, t=0.45, p<0.80). We can conclude the
student raters are reliable and PARRY1 generates reliable
ratings from two groups of raters.
DISCUSSION
The improvement (more towards the ratings received by
patients) of PARRY2 over PARRY1 along the dimension of linguistic
comprehension is statistically significant. However Parry2's rating
of 5.48 is still distant from the rating of 7.42 received by the
patients. How close should a simulation model come to its natural
counterpart? Everybody knows that noboby knows. Perhaps we have
reached the limit of approximation. Intuitively it seemed the model
should be able to do better if we could pinpoint its most serious
inadequacies.
We looked at each I-O pair which received a mean rating
of 5.0 or less. There were %%%% such cases. In %%% of these cases
the pattern was recognized but, dues to our own errors, the pointers
pointed to the wrong response functions. In the %%% remaining cases,
the pattern was not recognized. We corrected the pointers and then
repeated the experiment using five different clinicians who interviewed
PARRY1 and PARRY2.
EXPERIMENT 2